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k-d Match: A Fast Matching Algorithm for Sheared Stellar 
Samples 
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ABSTRACT 

This paper presents new and efficient algorithms for matching stellar catalogues where 
the transformation between the coordinate systems of the two catalagoues is unknown 
and may include shearing. Finding a given object whether a star or asterism from the 
first catalogue in the second is logarithmic in time rather than polynomial, yielding 
a dramatic speed up relative to a naive implementation. Both acceleration of the 
matching algorithm and the ability to solve for arbitrary affine transformations not 
only will allow the registration of stellar catalogues and images that are now impossible 
to use but also will find applications in machine vision and other imaging applications. 
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1 INTRODUCTION 

Finding the correspondances between catalogues of objects 
has long been a goal of astronomy. Generally, one first must 
place both catalgoues on the same coordinate system and 
then search in two dimensions for the nearest neighbour in 
the second catalogue of each object in the first catalogue. 
Often one does not know the coordinate transformation be- 
tween the catalogues initially, so part of the first step is 
to determine this transformation. Finding the transforma- 
tion between the coordinate systems also basicially involves 
a search for objects that correspond to each other i n each 
catalg oue and are invariant under the transformation. iGrothl 
II 19861 ) introduced the technique of looking for similar trian- 
gles in two catalogues. The property of similarity is invari- 
ant under translation, rotation, magnification and inversion. 
The algorithm outlined in this paper uses the ratio of sides 
as the invari a nt un der the coordinate transformation as in 
IValdes et al.l (|199a ) and searches for several triangles with 
similar transformations. In the end even a small catalogue 
can result in a large number of triangles to search in a cat- 
alogue: N(N - 1)(JV - 2)/6 or on the o rder of NJN 2 com - 
parisions to make. Several authors (e.g. IValdes et al.lll995T ) 
have outlined techniques to accelerate the calculation of 
the triangles (0(N 2 ) vs. C(iV 3 )) at the expense of stor- 
ing information and accelerating the search process which 
decreas es the prefactor on 0(Nf JVf ) by presorting the tri- 
angles (|Valdes et al.lll995J). weighting the triangles by the 
magnitudes of the stars (Scholl 1994), culling the triangles 
to compare (|Pal fc Bakos 2006) or quitting after only a frac- 
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tion of the triangles ha ve been com pared and a sufficiently 
good fit is determined (|Tabuj|2007l ). 

Both finding the matching triangles and later the 
matching objects in the catalogues require finding neigh- 
bours in a two-dimensional space. If the space were one di- 
mens ional, one wou ld use a binary tree to search in log N 
time. lBentlevI (|1975l ) developed a generalisation of the binary 
tree for arbitrary number of dimensions, the fc-d tree. In this 
case fc equals two. This algorithm dramatically speeds the 
search over the two dimensi ons, and runs the search for the 
astrometry.net algorithm (|Lang et al.ll2010l ). In particular, 
the process of finding the nearest neighbour in the two cat- 
alogues is sped up from 0(N 1 N 2 ) to O [{N\ + N 2 ) logJV 2 ]. 
For the triangle search the improvement is even more dra- 
matic from <D(NlNl) to O [(iVf + iV 2 3 ) log N 2 ] . 

The dramatic acceleration of the determination of the 
transformation encou rages a gener alisation of the triangle 
matching technique of lGrothl (1 19861 ). In particular the prop- 
erty of similarity of triangles is invariant under translation, 
rotation, magnification and inversion but not shearing. If 
there is an significant shearing between the two coordinate 
systems, a triangle matching algorithm will fail. However, it 
is straightforward to generalise this techinque for a general 
affine transformation. Such a transformation will preserve 
the ratio of areas, so the new technique is to build quadrilat- 
erals from sets of four objects in each catalogue and calculate 
the ratio of areas of the triangles that comprise the quadri- 
laterals. Only two of the three area ratios are independent 
so the final search is again two dimensional. However, the 
number of quadrilaterals is huge N(N-l)(N-2)(N-3)/24 
or on the order of N^N 2 direct comparisons to make. The 
fc-d tree accelerates this quadrilateral search dramatically to 
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O UNi + N 2 ) log N2] , so it is even faster than the custom- 
ary direct search over triangles. 

The following sections introduce the fc-d tree data struc- 
ture (§[2]), the quadrilateral search (§|3J|, the elimination of 
false positives (§|1J, present results of these new techniques 
(§0 an d discuss future applications (§|BJ. 



2 THE fc-D TREE 



iBentlevi (| 19751 ) introduced the fc-d tree as a multidimensional 
generalisation of a binary tree. In a binary tree each node 
contains its data, a key and two pointers one toward the 
daughter with a larger value of the key and one toward the 
daughter with a smaller value of the key. In the fc-d tree, 
the key is multidimensional. In the current case the key is 
two-dimensional. Again each node contains its data, its two 
keys and two pointers one toward the daughter with a larger 
value of the first key and one toward the daughter with a 
smaller value of the first key. Looking at one of the daugh- 
ter nodes, it contains pointers toward its daughter with a 
larger value of the second key and one toward the daughter 
with a smaller value of the second key. In this way each node 
divides its portion of its space in two with alternatively hor- 
izontal and vectical lines. The fc -d tree is straig htforward to 
generalise to more dimensions. IBentlevi l| 19751) showed the 
construction and the optimization of the tree requires on 
order of N log 2 N operations and searching the tree requires 
on order of log 2 N operations where N is the number of 
nodes in the tree. 

A possible alterna tive data structure is a quadtree 
IjFinkel fc BentlevHl974r ) in which each node has four chil- 
dren splitting the node's space into four quadrants. This 
has the disadvantage of that many more of the pointers 
within the data structure will be null than with a fc-d tree. 
Furthermore, although the quadtree may be generalised to 
higher dimensions, to implement it efficiently in various 
numbers of dimensions may require substantially different 
algorithms. On the other hand, in a fc-d tree one can alter- 
nate over however many dimensions that the key requires; 
the false-positive search (§0} uses a three-dimensional tree. 
lLang et al.l i|2010f ) use a fc-d tree to store four-dimensional 
keys representing the shape of four-star asterisms through- 
out the sky. They resort to a larger asterism because over the 
entire sky there would be many near misses for triangular 
aste risms given the typical positional errors in astronomy. 

iLana (2009J) has developed a very memory efficient fc-d 
tree algorithm for use with astrometry.net, and it is pub- 
licly available. This implementation does not use pointers 
between the nodes, but rather stores the nodes in an array 
such that there is a straightforward one-to-one correspon- 
dance between the position in the array and the location in 
the tree structure. However, matching small fields will be 
limited to hundred of stars and possibly millions of trian- 
gles, so memory efficiency is not crucial. The algorithm pre - 
sented here uses the implementation of iTsiombikaa l|2011f ). 
available on Google Code. For larger datasets, a more mem- 
ory efficient implementation may be helpful and could easily 
replace the library used here. 




Figure 1. A quadrilateral divided into two triangles in two dif- 
ferent ways by the dotted lines. On the left are triangles A and 
D, and on the right are B and C. The areas of each triangle are 
given. The vertices are labeled in order of the area of the triangle 
to which they are adjacent. 



3 THE QUADRILATERAL SEARCH 

The efficient fc-d tree data structure begs for more co mpli- 
cated problems to solve. In particular lLang et al.l l|2010f l de- 
cided to use four-star asterisms to reduce the number of false 
positives in their search throughout the sky to find a particu- 
lar star field. They developed a four-component key for each 
quadrilateral that is invariant under translations, rotations, 
scalings and inversion. For a triangle, one can only obtain 
a two-component key, for example the ratio of each of the 
smaller sides to the longest side. This improvement is sinrple 
to visualise when one considers a quadrilateral as the union 
of two triangles. Here the challenge will not be to find a par- 
ticular field somewhere in the sky, but to match a field with a 
given catalogue where the transformation between the coor- 
dinate systems may include shearing as well, i.e. to find the 
general affine transformation that relates the two coordinate 
systems. A general affine transformation does not preserve 
angles, t he relative l e ngth of sides or the geometric hash de- 
vised by lLang et al.l ((2010J). However, since the affine trans- 
formation applies the same Jacobian throughout the field, 
the ratio of areas of shapes will be preseved. In particular 
the ratio of the areas of the various triangles that comprise 
a quadrilateral will be preserved. Given the coordinates of 
each of the vertices of the quadrilateral it is straightforward 
to calculate the area of the triangle that spans three ver- 
tices by a cross product. Because one can chose to leave out 
a vertex from each triangle, each quadrilateral will generate 
four triangles as shown in Fig. [T] By sorting these triangles 
in area, the area of each triangle is normalized by that of 
the largest triangle. This would appear at first to yield three 
ratios B/A, C/A and D/A where A, B, C and D are the ar- 
eas of the triangles in decreasing order. However, the sum 
of the areas B + C equals A + D (both pairs make up the 
entire quadrilateral), so the ratio D/A is not independent 
and equals D/A = B/A + C/A — 1, so only the two areas 
B/A and C/A are used to represent the quadrilateral. 



4 FALSE POSITIVES 

Even a small catalogue will generate a large number of as- 
terisms (triangles and quadrilaterals), and there are likely to 
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be many near misses due to coincidences and observational 
errors. It is straightforward to query the k-d tree to get all 
the asterisms whose keys lie within a sphere surrounding 
a particular value. Because the keys used in this algorithm 
vary between zero and one, it is natural to select a tolerance 
appropriate for the errors in one's data (10 for the trian- 
gle search and 3 x 10 -3 for the quadrilateral search). Larger 
tolerances will yield more matching asterisms, so eliminat- 
ing false positives is crucial. To identify the best matching 
asterisms, the transformation between the coordinate sys- 
tems defined by the matching asterism is determined. It has 
the following form 



x 2 — axi + byx + c,y 2 — dxi + eyi + f. 



(1) 



If no shearing is present, a — e and b — —d, so the values of 
a, b, c and / can uniquely determine the transformation. The 
parameters of the transformation are determined by least- 
squares fitting over the points of the matching triangles or 
quadrilaterals and can include shear. The values of a, b, d 
and e are typically of order unity while c and / account for 
translations and may be large; therefore, for each matching 
asterism, the points that comprise the asterism in each cat- 
alogue are stored in a second k-d tree with the values of a, b 
and c/1000 serving as keys. The value of c must be scaled by 
a typical value that one expects for the translation, so that 
the three values within the k-d tree are of similar order. 

When a new matching asterism is found, its values of a, 
b and c are compared with those of the previous matches. If 
a previous match or matches are found with similar values 
of a, b and c/1000 (within a distance of 10 -3 in the space 
of a, b, c/1000), the coordinate transformation for all the 
points in all the matching asterisms with similar values of 
a,b and c is calculated. If the values of a, b and c for this 
transformation are similar to those for the individual aster- 
isms, it is likely that these are true matches. No effort is 
made to cull repeated stars from the ensemble of matching 
asterisms; therefore, stars that are members of several aster- 
isms have a larger weight in the fit for the transformation. 
These repeated stars do not appear multiple times in the 
final matched star lists. Because this false positive detec- 
tion scheme also relies on a k-d tree but with fewer entries 
(only the matching asterisms), it adds little overhead and in- 
creases the confidence and the accuracy in determining the 
transformation between the catalogues. The factor by which 
the x— translation parameter c is scaled can be given by the 
user. Furthermore, if a single transformation can account for 
at least 20 asterisms, the algorithm stops and outputs this 
best fitting transformation. Again the number of asterisms 
to count before quitting can be changed by the user. 



5 RESULTS 

The k-d match algorithm is used to find the correspondances 
between two stellar catalogues generated by Hubble Space 
Telescope (HST) data and to connect these catalogues to the 
Two Micron All Sky Survey (2MASS) cat alogue, yielding 
absol u te astrometry in th e 2MASS system IjSkrutskie et al.l 
12006. ) . iBedin et all (|2009r ) observed a field in the globular 
cluster Messier 4 to probe the faint end of the white-dwarf 
cooling sequence and obtained measurements in the HST 
bands F606W and F814W using the Advanced Camera for 
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Figure 2. The colour magnitude diagram for three overlapping 
samples in the globular cluster M4. In the upper panels, the green 
points are a visual sample, and the red points are a IR sample. 
The portion of the visual sample used to determine the coordinate 
transformation lies between the two green lines. The portion of 
the IR sample used lies above the red line. The lower panels depict 
the overlapping samples used to connect the observations in the 
visual bands (left) to the 2MASS observations (right). 



Survey (ACS). Additional data in these bands were obtained 
as part of the program GO-9578 (PI: Rhodes). The second 
set of data is new from recent HST observations by Dieball 
and collaborators in F110W and F160W using the Wide- 
Field Camera 3 (WFC3). The upper panel Fig. [2] depicts 
the colour-magnitude diagram of the two catalogues. The 
catalogue of objects detected in the visual bands is deeper, 
spans a larger area and also contains brighter stars than 
the IR catalogue, so taking the brighest thirty stars in each 
catalogue failed to find matches; therefore, to find the pos- 
sible correspondences, the IR colours and magnitudes are 
plotted adjacent to those of the visual bands to line up the 
turn-off in the two samples. The brightest 25 stars from the 
IR sample are matched against the 730 stars in the visual 
sample whose magnitudes lie between 13.5 and 16. There are 
2,300 triangles in the IR sample and 64,569,960 in the visual 
sample. No effort is made to cull triangles from the either 
sample. It is most efficient to build the k-d tree from the 
smaller sample; this is what the package does by default. A 
direct comparison of all the triangles in the two catalogues 
numbers 148,510,908,000 and required about three hours on 
a laptop computuer. The fc-d tree required forty seconds on 
the same computer. Of course, one could spend some time to 
cull the second list of stars to perhaps thirty or so and per- 
form the direct comparison but this would have to be done 
carefully because the overlap of the two fields is partial and 
would probably require more than thirty seconds work. 

The second step is to find the transformation between 
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Figure 3. The positions of the stars in the three overlapping 
samples in the globular cluster M4. The green points are the visual 
sample, the red points are the IR sample, and black points are 
point sources in the 2MASS catalogue. 



the visual sample and the 2MASS coordinate system using 
objects found in the 2MASS Point Source catalogue within 
five arcminutes of the centre of M4 as depicted in the lower 
right panel of Fig. [2] There are 88 stars in the visual sam- 
ple and 354 in the 2MASS PSC. Finding the best match- 
ing transformation between the two coordinate systems re- 
quires about six seconds. Both of these calculations show 
the power of the fc-d match algorithm. In both cases a large 
fraction of the stars in each catalogue are unique. They don't 
appear in both catalogues. Finding the matching stars re- 
quires a more exhaustive search than simply comparing say 
the bright est thirty stars in e ach catalogue. Although DAO- 
MATCH l)Valdes et al.lll995h was not designed for matching 
catalogues with small fractional overlap, DAOMATCH was 
used to try to find the best transformations between these 
catalogues to provide a benchmark against a standard rou- 
tine. Because of the lack of overlap, DAOMATCH failed to 
find the correspondences and generally stopped after com- 
paring the first thirty stars in each catalogue. This exhaus- 
tive search typically takes about 0.5 seconds. A similar ex- 
haustive search with fc— d match takes 0.007 seconds, about 
a factor of one hundred faster. 

The triangle matching determines the coordinate trans- 
formations among the IR sample, the visual sample and the 
2MASS catalogue. Fig. [3] depicts the locations of the stars 
in the IR sample in the coordinate system of the 2MASS 
Point Source Catalogue. The total IR sample contains 2,192 
stars, the deeper and wider visual sample contains 13,122 
stars and the 2MASS PSC sample contains 3761 stars. The 
second step to merge the catagoues is also dramatically ac- 
celerated. To test the fc-d match algorithm against a variety 
of catalogue sizes, each catalogue is copied onto itself ten or 
one hundred times and the nearest object in the second cat- 
alogue is determined for each object in the first catalogue. 
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Figure 4. The total runtime to match the objects in the IR 
catalogue to those in the visual catalogue as a function of the 
number of objects in each catalogue. The red circles trace the 
results for the direct technique, and the green triangles give the 
k-d tree algorithm. 



Other criteria are possible. For example, each object in the 
first catalogue can be associated with all the objects within 
a given search radius in the second catalogue. Fig. [4] depicts 
the runtime for each of the algorithms as a function of the 
number of objects in each catalogue. The left panel gives 
the runtime as a function of the product of the number of 
the objects in each catalogue, and the runtime of the direct 
method is approximately proportional to this product. The 
right panel gives plots the runtime against the expected log- 
arithmic dependence of runtime for the fe-d tree algorithm. 
In both cases the straight lines give a linear relation between 
the two axes, normalized by the runtime of the smallest pair 
of catalogues. Even for relatively small catalogues of a thou- 
sand obj ects, the fc-d tr ee algorithm outperforms the direct 
method. iBentlevI (|l9T5T) argue the fc-d tree search should 
perform better than a direct search as along as there are 
more than 2 (i.e. four) nodes in the tree. 

The second test uses same stars as the test for the tri- 
angle algorithm. In particular a random affine transforma- 
tion is applied to the coordinates of the brightest 25 stars 
of the IR sample, and this transformed set of coordinates 
is matched against the original coordinates. The triangle 
algorithm fails to find any matching triangles among the 
2,300 asterisms and requires about 5 ms to run. The quadri- 
lateral matching technique (25,300 asterisms) takes signifi- 
cantly longer, about 50 ms but consistently finds the correct 
transformation . 



6 DISCUSSION 

This paper has outlined a new efficient algorithm for match- 
ing stellar catalogues or lists of coordinates in general. For 
large lists, the speed-up relative to the generally available 
techniques that perform a naive direct comparison are dra- 
matic. This speed-up allows the use of larger source lists 
to accommodate catalogues in vastly different bands and 
where one expects only a partial overlap between the lists 
of objects. The efficiency of the algorithm yields an efficient 
implementation of the more difficult problem where the two 
coordinate systems differ by an arbitrary affine transforma- 
tion; that is when shearing is present. This may find appli- 
cation for astronomical catalogues where the instrumenta- 
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tional shearing has not been corrected or is unknown. Ad- 
ditionally, this quadrilateral match algorithm could also be 
applied more generally. For example, if one ignores perspec- 
tive the rotation and projection of markers on or within 
three-dimensional object into a two-dimensional image is 
equivalent to a general affine transformation with a only 
few caveats when the object is transparent. Therefore, the 
quadrilateral matching algorithm could be useful for appli- 
cations as diverse as facial recognition and the registration 
of medical and satellite imagery. 

All of the routines outlined in this paper are 
available under the "New BSD" license at Source- 
forge (https://sourceforge.net/projects/kdmatch). The 
reader is encouraged to experiment with his or her data. 
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